Assignment:

  1. How did you choose a model for EM? Evaluate the model performance.

  2. Cluster some of your data using EM based clustering that you also used for k-means, PAM, and hierarchical clustering. How do the clustering approaches compare on the same data?

In assignment M4L2, I choose the Breast Cancer Wisconsin (Prognostic) data set

Loading the data:

#data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.data'

#cancer_data <- read.table(url(data_url), sep = ',')
cancer_data<- read.table('wpbc.data.txt', sep=",")

names(cancer_data) <- c('ID number', 'Outcome','Time','radius_mean','texure_mean','perimeter_mean','area_mean','smoothness_mean','compactness_mean','concavity_mean','concave_points_mean','symmetry_mean','fractal_dimension_mean', 'radius_SE','texure_SE','perimeter_SE','area_SE','smoothness_SE','compactness_SE','concavity_SE','concave_points_SE','symmetry_SE','fractal_dimension_SE','radius_worst','texure_worst','perimeter_worst','area_worst','smoothness_worst','compactness_worst','concavity_worst','concave_points_worst','symmetry_worst','fractal_dimension_worst','tumor_size','lymph_node_status') 
cancer <- data.frame(cancer_data[,4:13])
head(cancer)
##   radius_mean texure_mean perimeter_mean area_mean smoothness_mean
## 1       18.02       27.60         117.50    1013.0         0.09489
## 2       17.99       10.38         122.80    1001.0         0.11840
## 3       21.37       17.44         137.50    1373.0         0.08836
## 4       11.42       20.38          77.58     386.1         0.14250
## 5       20.29       14.34         135.10    1297.0         0.10030
## 6       12.75       15.29          84.60     502.7         0.11890
##   compactness_mean concavity_mean concave_points_mean symmetry_mean
## 1           0.1036         0.1086             0.07055        0.1865
## 2           0.2776         0.3001             0.14710        0.2419
## 3           0.1189         0.1255             0.08180        0.2333
## 4           0.2839         0.2414             0.10520        0.2597
## 5           0.1328         0.1980             0.10430        0.1809
## 6           0.1569         0.1664             0.07666        0.1995
##   fractal_dimension_mean
## 1                0.06333
## 2                0.07871
## 3                0.06010
## 4                0.09744
## 5                0.05883
## 6                0.07164
str(cancer)
## 'data.frame':    198 obs. of  10 variables:
##  $ radius_mean           : num  18 18 21.4 11.4 20.3 ...
##  $ texure_mean           : num  27.6 10.4 17.4 20.4 14.3 ...
##  $ perimeter_mean        : num  117.5 122.8 137.5 77.6 135.1 ...
##  $ area_mean             : num  1013 1001 1373 386 1297 ...
##  $ smoothness_mean       : num  0.0949 0.1184 0.0884 0.1425 0.1003 ...
##  $ compactness_mean      : num  0.104 0.278 0.119 0.284 0.133 ...
##  $ concavity_mean        : num  0.109 0.3 0.126 0.241 0.198 ...
##  $ concave_points_mean   : num  0.0706 0.1471 0.0818 0.1052 0.1043 ...
##  $ symmetry_mean         : num  0.186 0.242 0.233 0.26 0.181 ...
##  $ fractal_dimension_mean: num  0.0633 0.0787 0.0601 0.0974 0.0588 ...

In the breast cancer dataset, there are 35 columns. Two of them are factors and others are number. Here I use column 4-13 as the clustering data.

library("mclust")
## Package 'mclust' version 5.2
## Type 'citation("mclust")' for citing this R package in publications.
em_clust <- Mclust(cancer)
em_clust 
## 'Mclust' model object:
##  best model: ellipsoidal, equal shape and orientation (VEE) with 2 components

1. How did you choose a model for EM? Evaluate the model performance.

According to the tutorial of mages’ blog, I use fitdistrplus package to fit distributions

library("fitdistrplus")
## Loading required package: MASS
for (i in 1:10){
  fit <- fitdist(cancer[,i], distr = "norm", method = "mle", discrete = F)
  cat("This is column ",i)
  plot(fit)
}
## This is column  1

## This is column  2

## This is column  3

## This is column  4

## This is column  5

## This is column  6

## This is column  7

## This is column  8

## This is column  9

## This is column  10

The function fitdist will created four graphs: the density plot, Q-Q plot, CDF, P-P plot. According to the results of Q-Q plots, the data from these 10 columns are from normal distribution. So I will choose mixture model which is a mixtrure of Gasussians.

#evaluate the model performance
summary(em_clust)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VEE (ellipsoidal, equal shape and orientation) model with 2 components:
## 
##  log.likelihood   n df      BIC      ICL
##        1290.914 198 77 2174.631 2158.937
## 
## Clustering table:
##   1   2 
##  50 148
# BIC
plot(em_clust, what = "BIC")

#classification
plot(em_clust, what = "classification")

#uncertainty
plot(em_clust, what = "uncertainty")

#density
plot(em_clust, what = "density")

#ICL 
ICL = mclustICL(cancer)
summary(ICL)
## Best ICL values:
##             VEE,2      VEE,6      VEE,3
## ICL      2158.937 2122.20671 2120.16562
## ICL diff    0.000  -36.73064  -38.77172
plot(ICL)

  1. Cluster some of your data using EM based clustering that you also used for k-means, PAM, and hierarchical clustering. How do the clustering approaches compare on the same data?